ddir
plot
and text
functions togetherrelationship
and education
mtry = 2
# MDSR book says mtry = 3, think about why 3 doesn’t work…RQ: What risk factors are associated with serious thoughts of suicide?
HELPstudy
Datasuicide
(response; {“no”, “yes”}): experienced serious thoughts of suicide in last 30 days (measured at baseline)cesd
: Center for Epidemiologic Studies Depression measure at baseline (high scores indicate more depressive symptoms)sex
: male or femaleavgAlcohol
: average number of drinks (standard units) consumed per day, in the past 30 days (measured at baseline)pss_fr
: perceived social support by friends (measured at baseline, higher scores indicate more support)treat
randomized to HELP clinic: no yes```r
```r
HELPstudy <-
HELPrct %>%
select(suicide = g1b, age, cesd, sex, avgAlcohol = i1, pss_fr)
# inspect data
glimpse(HELPstudy)
Observations: 453
Variables: 6
$ suicide <fct> yes, yes, no, no, no, no, yes, yes, no, no, no, no, no, no, no, yes, no, yes, yes, no, no, …
$ age <int> 37, 37, 26, 39, 32, 47, 49, 28, 50, 39, 34, 58, 58, 60, 36, 28, 35, 29, 27, 27, 41, 33, 34,…
$ cesd <int> 49, 30, 39, 15, 39, 6, 52, 32, 50, 46, 46, 49, 22, 36, 43, 35, 19, 40, 52, 37, 35, 18, 36, …
$ sex <fct> male, male, male, female, male, female, female, male, female, male, female, female, male, m…
$ avgAlcohol <int> 13, 56, 0, 5, 10, 4, 13, 12, 71, 20, 0, 13, 20, 13, 51, 0, 0, 1, 9, 23, 26, 0, 34, 4, 6, 3,…
$ pss_fr <int> 0, 1, 13, 11, 10, 5, 1, 4, 5, 0, 0, 13, 13, 1, 1, 7, 9, 1, 13, 11, 8, 14, 10, 6, 6, 3, 6, 4…
```r
```r
# set RNG seed for reproducible results
set.seed(538)
# partition the data
n <- nrow(HELPstudy)
test_idx <- sample.int(n, size = round(0.25 * n)) # select row numbers for the test set
train <- HELPstudy[-test_idx, ] # exclude the test set cases
nrow(train)
[1] 340
```r
```r
test <- HELPstudy[test_idx, ] # test set cases only
nrow(test)
[1] 113
train
) for model buildingtest
set as a means to validate model accuracyHELPstudy
(training) data```r
```r
# short (insufficient) EDA
tally(~ suicide, data = train)
suicide
no yes
247 93
```r
```r
favstats(~ age, data = train)
favstats(~ cesd, data = train)
```r
```r
tally(~ sex, data = train)
sex
female male
85 255
```r
```r
favstats(~ avgAlcohol, data = train)
min <dbl> | Q1 <dbl> | median <dbl> | Q3 <dbl> | max <dbl> | mean <dbl> | sd <dbl> | n <int> | missing <int> | |
---|---|---|---|---|---|---|---|---|---|
19 | 30 | 34 | 40 | 59 | 35.35588 | 7.499054 | 340 | 0 |
```r
```r
favstats(~ pss_fr, data = train)
min <dbl> | Q1 <dbl> | median <dbl> | Q3 <dbl> | max <dbl> | mean <dbl> | sd <dbl> | n <int> | missing <int> | |
---|---|---|---|---|---|---|---|---|---|
1 | 24 | 34 | 41 | 60 | 32.50294 | 12.76275 | 340 | 0 |
```r
```r
# single plot; ugly, but possibly useful (you can do better)
pairs(train)
min <dbl> | Q1 <dbl> | median <dbl> | Q3 <dbl> | max <dbl> | mean <dbl> | sd <dbl> | n <int> | missing <int> | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | 13 | 26 | 134 | 17.17059 | 18.76906 | 340 | 0 |
min <dbl> | Q1 <dbl> | median <dbl> | Q3 <dbl> | max <dbl> | mean <dbl> | sd <dbl> | n <int> | missing <int> | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | 7 | 10 | 14 | 6.802941 | 4.010237 | 340 | 0 |
```r
```r
mod_null <- tally(~ suicide, data = train, format = \percent\)
mod_null
suicide
no yes
72.65 27.35
suicide
from just cesd
(a depression score)```r
```r
require(rpart)
rpart(suicide ~ cesd, data = train)
n= 340
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 340 93 no (0.7265 0.2735)
2) cesd< 34.5 183 24 no (0.8689 0.1311) *
3) cesd>=34.5 157 69 no (0.5605 0.4395)
6) cesd< 54.5 147 60 no (0.5918 0.4082)
12) cesd>=45.5 46 14 no (0.6957 0.3043) *
13) cesd< 45.5 101 46 no (0.5446 0.4554)
26) cesd< 41.5 78 31 no (0.6026 0.3974) *
27) cesd>=41.5 23 8 yes (0.3478 0.6522) *
7) cesd>=54.5 10 1 yes (0.1000 0.9000) *
```r
```r
split <- 34.5 # first split from simple `rpart`
train %>%
mutate(elevated_depression = cesd >= split) %>%
ggplot(aes(x = cesd, y = suicide)) +
geom_point(aes(color = elevated_depression),
position = position_jitter(width = 0, height = 0.15),
alpha = 0.4) +
geom_vline(xintercept = split, color = \blue\, lty = 2)
form
form <- as.formula("suicide ~ age + cesd + sex + avgAlcohol + pss_fr")
suicide ~ .
since we’re using all the variables in train
```r
```r
mod_tree <- rpart(suicide ~ ., data = train)
mod_tree
n= 340
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 340 93 no (0.7265 0.2735)
2) cesd< 34.5 183 24 no (0.8689 0.1311) *
3) cesd>=34.5 157 69 no (0.5605 0.4395)
6) cesd< 54.5 147 60 no (0.5918 0.4082)
12) avgAlcohol< 25.5 102 34 no (0.6667 0.3333)
24) age>=33.5 49 10 no (0.7959 0.2041) *
25) age< 33.5 53 24 no (0.5472 0.4528)
50) avgAlcohol< 2.5 19 6 no (0.6842 0.3158) *
51) avgAlcohol>=2.5 34 16 yes (0.4706 0.5294)
102) cesd< 41.5 24 10 no (0.5833 0.4167) *
103) cesd>=41.5 10 2 yes (0.2000 0.8000) *
13) avgAlcohol>=25.5 45 19 yes (0.4222 0.5778)
26) pss_fr>=8.5 13 4 no (0.6923 0.3077) *
27) pss_fr< 8.5 32 10 yes (0.3125 0.6875)
54) sex=male 22 10 yes (0.4545 0.5455)
108) pss_fr>=1.5 14 6 no (0.5714 0.4286) *
109) pss_fr< 1.5 8 2 yes (0.2500 0.7500) *
55) sex=female 10 0 yes (0.0000 1.0000) *
7) cesd>=54.5 10 1 yes (0.1000 0.9000) *
```r
```r
require(partykit)
plot(as.party(mod_tree))
printcp
prints a table of optimal prunings based on the complexity parameter```r
```r
printcp(mod_tree)
Classification tree:
rpart(formula = suicide ~ ., data = train)
Variables actually used in tree construction:
[1] age avgAlcohol cesd pss_fr sex
Root node error: 93/340 = 0.27
n= 340
CP nsplit rel error xerror xstd
1 0.043 0 1.00 1.0 0.088
2 0.022 4 0.78 1.0 0.089
3 0.011 7 0.72 1.0 0.089
4 0.010 9 0.70 1.1 0.090
```r
```r
train_tree <-
train %>%
mutate(suicide_dtree = predict(mod_tree, type = \class\))
confusion <- tally(suicide_dtree ~ suicide, data = train_tree, format = \count\)
confusion
suicide
suicide_dtree no yes
no 242 60
yes 5 33
```r
```r
# decision tree model accuracy
dtree_acc <- sum(diag(confusion)) / nrow(train) * 100
dtree_acc
[1] 80.88
```r
```r
# different control parameter for comparison
mod_tree2 <- rpart(suicide ~ ., data = train, control = rpart.control(cp = 0.03))
plot(as.party(mod_tree2))
```r
```r
train_tree2 <-
train %>%
mutate(suicide_dtree = predict(mod_tree2, type = \class\))
confusion <- tally(suicide_dtree ~ suicide, data = train_tree2, format = \count\)
confusion
suicide
suicide_dtree no yes
no 236 62
yes 11 31
```r
```r
sum(diag(confusion)) / nrow(train) * 100
[1] 78.53
ntree
argument) AND the number of variables to consider in each tree (mtry
argument)mtry
of the explanatory variables at each splitntree
timesntree
shouldn’t be too small… but can be computationally intensivemtry
defaults to √p for classification trees and p/3 for regression trees```r
```r
require(randomForest)
mod_forest <- randomForest(suicide ~ ., data = train, ntree = 2000, mtry = 2)
mod_forest
Call:
randomForest(formula = suicide ~ ., data = train, ntree = 2000, mtry = 2)
Type of random forest: classification
Number of trees: 2000
No. of variables tried at each split: 2
OOB estimate of error rate: 28.82%
Confusion matrix:
no yes class.error
no 213 34 0.1377
yes 64 29 0.6882
```r
```r
# note the built-in confusion matrix
rf_acc <- sum(diag(mod_forest$confusion)) / nrow(train) * 100
rf_acc
[1] 71.18
cesd
) appears most importantsex
least important```r
```r
require(tibble)
importance(mod_forest) %>%
as.data.frame() %>%
rownames_to_column() %>% # handy function from `tibble` package
arrange(desc(MeanDecreaseGini))
rowname <chr> | MeanDecreaseGini <dbl> | |||
---|---|---|---|---|
cesd | 43.913571 | |||
avgAlcohol | 32.542581 | |||
age | 28.755754 | |||
pss_fr | 21.887942 | |||
sex | 5.298385 |
```r
```r
require(gbm)
# response must be converted to 0/1
train_boost <-
train %>%
mutate(suicide01 = if_else(suicide == \yes\, 1, 0)) %>%
select(-suicide)
mod_boost <- gbm(suicide01 ~ ., distribution = \bernoulli\, # because it's a classifier model
data = train_boost, n.trees = 3000, interaction.depth = 2)
# the relative influence is similar to importance result earlier
msummary(mod_boost)
var <chr> | rel.inf <dbl> | |||
---|---|---|---|---|
cesd | cesd | 33.503692 | ||
age | age | 25.003537 | ||
avgAlcohol | avgAlcohol | 24.544282 | ||
pss_fr | pss_fr | 14.105824 | ||
sex | sex | 2.842664 |
sex
)image credit: James et al (2013) http://www-bcf.usc.edu/~gareth/ISL/
cl = train$suicide
as the supervising variable (from train
, not train_quant
)```r
```r
require(class)
train_quant <-
train %>%
select(age, cesd, avgAlcohol, pss_fr)
# KNN classifier
suicide_knn <- knn(train = train_quant, test = train_quant, cl = train$suicide, k = 5)
# confusion matrix
confusion <- tally(suicide_knn ~ suicide, data = train, format = \count\)
confusion
suicide
suicide_knn no yes
no 232 55
yes 15 38
```r
```r
knn_acc <- sum(diag(confusion)) / nrow(train) * 100
knn_acc
[1] 79.41
```r
```r
knn_error_rate <- function(x, y, numNeighbors, z = x) {
# purpose: fit knn classifier and return proportion of misclassifications
# inputs:
### x: training data
### y: true classifcations from data
### numNeighbors: number of nearest neighbors for classifier
### z: test data (default to training data)
y_hat <- knn(train = x, test = z, cl = y, k = numNeighbors)
return(sum(y_hat != y) / nrow(x))
}
ks <- c(1:15, 20, 25, 30, 40, 50)
train_rates <- sapply(ks, FUN = knn_error_rate, x = train_quant, y = train$suicide)
knn_error_rates <- data.frame(k = ks,
train_rate = train_rates)
knn_error_rates %>%
ggplot(aes(x = k, y = train_rate)) +
geom_point() +
geom_line() +
ylab(\Misclassification Rate\)
image credit: James et al (2013) http://www-bcf.usc.edu/~gareth/ISL/
Pr(y|x)=Pr(xy)Pr(x)=Pr(x|y)Pr(y)Pr(x)
suicide
= “yes” | sex
= “male”)Pr(y|x)=Pr(xy)Pr(x)=Pr(x|y)Pr(y)Pr(x)
suicide == "yes"
sex == "male"
```r
```r
head(train, 1)
tally(sex ~ suicide, data = train, margins = TRUE)
suicide
sex no yes
female 53 32
male 194 61
Total 247 93
```r
```r
tally( ~ sex, data = train, margins = TRUE)
sex
female male Total
85 255 340
```r
```r
tally( ~ suicide, data = train, margins = TRUE)
suicide
no yes Total
247 93 340
suicide <fctr> | age <int> | cesd <int> | sex <fctr> | avgAlcohol <int> | pss_fr <int> | |
---|---|---|---|---|---|---|
3 | no | 26 | 39 | male | 0 | 13 |
suicide
= “yes” | sex
= “male”)P(suicide|male)=P(male|suicide)P(suicide)P(male)=(61/93)(93/340)(255/340)=0.180.75=0.24
sex
) into consideration```r
```r
require(e1071) # awful name for a package...
mod_nb <- naiveBayes(suicide ~ ., data = train)
suicide_nb <- predict(mod_nb, newdata = train)
confusion <- tally(suicide_nb ~ suicide, data = train, format = \count\)
confusion
suicide
suicide_nb no yes
no 224 62
yes 23 31
```r
```r
nb_acc <- sum(diag(confusion)) / nrow(train) * 100
nb_acc
[1] 75
```r
```r
require(nnet)
mod_nnet <- nnet(suicide ~ ., data = train, size = 8)
# weights: 57
initial value 230.113258
iter 10 value 177.092442
iter 20 value 169.786429
iter 30 value 165.360178
iter 40 value 163.286366
iter 50 value 160.656488
iter 60 value 156.073589
iter 70 value 153.980013
iter 80 value 153.830620
final value 153.830466
converged
```r
```r
suicide_nn <- predict(mod_nnet, newdata = train, type = \class\)
confusion <- tally(suicide_nn ~ suicide, data = train, format = \count\)
confusion
suicide
suicide_nn no yes
no 236 59
yes 11 34
```r
```r
nnet_acc <- sum(diag(confusion)) / nrow(train) * 100
nnet_acc
[1] 79.41
```r
```r
# Plotting artificial neural networks (no plotting function in `nnet` package)
# import function from Github
library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef
```r
```r
# plot ANN
plot.nnet(mod_nnet)
Loading required package: reshape
there is no package called ‘reshape’
```r
```r
# logistic
mod_logit <- glm(suicide ~ ., data = train, family = \binomial\)
msummary(mod_logit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.60519 0.86254 -1.86 0.063 .
age -0.03458 0.01854 -1.86 0.062 .
cesd 0.06333 0.01259 5.03 4.9e-07 ***
sexmale -0.46303 0.30584 -1.51 0.130
avgAlcohol 0.01496 0.00691 2.17 0.030 *
pss_fr -0.04799 0.03408 -1.41 0.159
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 398.98 on 339 degrees of freedom
Residual deviance: 345.44 on 334 degrees of freedom
AIC: 357.4
Number of Fisher Scoring iterations: 4
```r
```r
suicide_logitProb <- predict(mod_logit, newdata = train, type = \response\)
suicide_logit <- ifelse(suicide_logitProb > 0.5, yes = \yes\, \no\)
confusion <- tally(suicide_logit ~ suicide, data = train, format = \count\)
confusion
suicide
suicide_logit no yes
no 229 69
yes 18 24
```r
```r
logit_acc <- sum(diag(confusion)) / nrow(train) * 100
logit_acc
[1] 74.41
image credit: James et al (2013) http://www-bcf.usc.edu/~gareth/ISL/
```r
```r
require(e1071) # again, awful name for a package...
# mod_svm = svm(suicide ~ ., data = train, kernel = \linear\, cost = 10, scale = FALSE)
mod_svm = svm(suicide ~ ., data = train, kernel = \radial\, cost = 10, scale = FALSE)
print(mod_svm)
Call:
svm(formula = suicide ~ ., data = train, kernel = \radial\, cost = 10, scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 10
gamma: 0.1667
Number of Support Vectors: 340
```r
```r
suicide_svm <- predict(mod_svm, newdata = train)
confusion <- tally(suicide_svm ~ suicide, data = train, format = \count\)
confusion
suicide
suicide_svm no yes
no 247 0
yes 0 93
```r
```r
svm_acc <- sum(diag(confusion)) / nrow(train) * 100
svm_acc
[1] 100
```r
```r
# summary of results
ModelComparison <-
tribble(
~model, ~accuracy,
\**NULL MODEL**\, mod_null[1],
\decision tree\, dtree_acc,
\random forest\, rf_acc,
\k-nearest neighbors\, knn_acc,
\naive Bayes\, nb_acc,
\neural network\, nnet_acc,
\logistic regression\, logit_acc
)
ModelComparison %>%
arrange(desc(accuracy))
model <chr> | accuracy <dbl> | |||
---|---|---|---|---|
decision tree | 80.88235 | |||
k-nearest neighbors | 79.41176 | |||
neural network | 79.41176 | |||
naive Bayes | 75.00000 | |||
logistic regression | 74.41176 | |||
**NULL MODEL** | 72.64706 | |||
random forest | 71.17647 |
vote
?```r
```r
vote <- 3
suicide_ensemble <-
ifelse((suicide_knn == \yes\) +
(suicide_nn == \yes\) +
(suicide_nb == \yes\) +
(suicide_logit == \yes\) +
(mod_forest$predicted == \yes\) >= vote,
\yes\, \no\)
confusion <- tally(suicide_ensemble ~ suicide, data = train, format = \count\)
confusion
suicide
suicide_ensemble no yes
no 235 62
yes 12 31
```r
```r
ens_acc <- sum(diag(confusion)) / nrow(train) * 100
# summary of results
ModelComparison <-
tribble(
~model, ~accuracy,
\**NULL MODEL**\, mod_null[1],
\random forest\, rf_acc,
\k-nearest neighbors\, knn_acc,
\naive Bayes\, nb_acc,
\neural network\, nnet_acc,
\logistic regression\, logit_acc,
\**ENSEMBLE**\, ens_acc
)
ModelComparison %>%
arrange(desc(accuracy))
model <chr> | accuracy <dbl> | |||
---|---|---|---|---|
k-nearest neighbors | 79.41176 | |||
neural network | 79.41176 | |||
**ENSEMBLE** | 78.23529 | |||
naive Bayes | 75.00000 | |||
logistic regression | 74.41176 | |||
**NULL MODEL** | 72.64706 | |||
random forest | 71.17647 |
SuperLearner
package has some nice tools to do this directly
SuperLearner
calls other packages corresponding to the available methods, and you need to install your own package dependencies to access those methods```r
```r
require(SuperLearner)
listWrappers() # list available models in `SuperLearner`
All prediction algorithm wrappers in SuperLearner:
[1] \SL.bartMachine\ \SL.bayesglm\ \SL.biglasso\ \SL.caret\
[5] \SL.caret.rpart\ \SL.cforest\ \SL.dbarts\ \SL.earth\
[9] \SL.extraTrees\ \SL.gam\ \SL.gbm\ \SL.glm\
[13] \SL.glm.interaction\ \SL.glmnet\ \SL.ipredbagg\ \SL.kernelKnn\
[17] \SL.knn\ \SL.ksvm\ \SL.lda\ \SL.leekasso\
[21] \SL.lm\ \SL.loess\ \SL.logreg\ \SL.mean\
[25] \SL.nnet\ \SL.nnls\ \SL.polymars\ \SL.qda\
[29] \SL.randomForest\ \SL.ranger\ \SL.ridge\ \SL.rpart\
[33] \SL.rpartPrune\ \SL.speedglm\ \SL.speedlm\ \SL.step\
[37] \SL.step.forward\ \SL.step.interaction\ \SL.stepAIC\ \SL.svm\
[41] \SL.template\ \SL.xgboost\
All screening algorithm wrappers in SuperLearner:
[1] \All\
[1] \screen.corP\ \screen.corRank\ \screen.glmnet\ \screen.randomForest\
[5] \screen.SIS\ \screen.template\ \screen.ttest\ \write.screen.template\
image credit: James et al (2013) http://www-bcf.usc.edu/~gareth/ISL/
caret
package):
```r
```r
# k-Nearest Neighbor
test_knn <- knn(train = train_quant, cl = train$suicide, k = 5,
test = select(test, age, cesd, avgAlcohol, pss_fr))
confusion <- tally(test_knn ~ suicide, data = test, format = \count\)
confusion # test data
suicide
test_knn no yes
no 68 23
yes 11 11
```r
```r
sum(diag(confusion)) / nrow(test)
[1] 0.6991
predict()
method associated```r
```r
# slight modification of our null model (intercept-only logit model)
mod_null <- glm(suicide ~ 1, data = train, family = \binomial\)
# store our various models as a list
models <- list(mod_null, mod_tree, mod_tree2, mod_forest, mod_nnet, mod_nb, mod_logit)
# # models and various `predict()` methods available
# lapply(models, FUN = class)
# methods(\predict\)
from MDSR Figure 8.6 (p. 191)
suicide
: {“yes”, “no”})HELPrct
example, what do the two distributions represent?image credit (2/13/2019): https://en.wikipedia.org/wiki/Receiver_operating_characteristic
```r
```r
require(ROCR)
# output arguments corresponding to objects in `model` list
outputs <- c(\response\, \prob\, \prob\, \prob\, \raw\, \raw\, \response\)
roc_test <- mapply(FUN = predict, models, type = outputs, MoreArgs = list(newdata = test)) %>%
as.data.frame() %>%
select(1, 3, 5, 7, 8, 10, 11)
names(roc_test) <- c(\mod_null\, \mod_tree\, \mod_tree2\, \mod_forest\, \mod_nnet\, \mod_nb\, \mod_logit\)
glimpse(roc_test)
Observations: 113
Variables: 7
$ mod_null <dbl> 0.2735, 0.2735, 0.2735, 0.2735, 0.2735, 0.2735, 0.2735, 0.2735, 0.2735, 0.2735, 0.2735, 0.2…
$ mod_tree <dbl> 0.1311, 0.4286, 0.1311, 0.1311, 0.1311, 0.1311, 0.4167, 0.1311, 1.0000, 0.1311, 0.1311, 0.4…
$ mod_tree2 <dbl> 0.1311, 0.6875, 0.1311, 0.1311, 0.1311, 0.1311, 0.3333, 0.1311, 0.6875, 0.1311, 0.1311, 0.3…
$ mod_forest <dbl> 0.1635, 0.4665, 0.1430, 0.1660, 0.0210, 0.1960, 0.2450, 0.0305, 0.5415, 0.0535, 0.1835, 0.4…
$ mod_nnet <dbl> 0.72469, 0.00000, 0.04715, 0.12415, 0.04715, 0.18977, 0.69428, 0.04715, 0.72469, 0.18977, 0…
$ mod_nb <dbl> 0.48084, 0.97927, 0.14694, 0.43275, 0.05504, 0.11254, 0.20647, 0.01172, 0.81458, 0.06190, 0…
$ mod_logit <dbl> 0.34117, 0.55543, 0.15191, 0.33218, 0.09026, 0.13946, 0.24992, 0.03694, 0.65802, 0.09861, 0…
```r
```r
roc_tidy <-
roc_test %>%
gather(key = \model\, value = \y_hat\) %>%
group_by(model) %>%
dplyr::do(get_roc(., y = test$suicide))
# plot ROC curves
roc_tidy %>%
ggplot(aes(x = fpr, y = tpr)) +
geom_line(aes(color = model)) +
geom_abline(intercept = 0, slope = 1, lty = 3) +
ylab(\True positive rate (sensitivity)\) +
xlab(\False positive rate (1 - specificity)\) +
geom_point(data = predictions_summary, size = 3,
aes(x = test_fpr, y = test_tpr, color = model))
image credit: James et al (2013) http://www-bcf.usc.edu/~gareth/ISL/
For a given test value, x0:
E(y0−ˆf(x0))2=Var(ˆf(x0))+[Bias(ˆf(x0))]2+Var(ϵ)
glmnet::glmnet()
alpha = 0
for ridge regressionalpha = 1
for Lasso```r
```r
require(glmnet)
train_matrix <-
train_quant %>%
mutate(male = ifelse(train$sex == \male\, 1, 0)) %>%
as.matrix()
# fit model
mod_lasso <- glmnet(x = train_matrix, y = train$suicide, family = \binomial\, alpha = 1)
# plot coefficients as a function of \penalty\ imposed by tuning parameter
plot(mod_lasso)
```r
```r
# choosing lambda
cv_result <- cv.glmnet(x = as.matrix(train_quant), y = train$suicide, family = \binomial\, type.measure = \class\)
plot(cv_result)
```r
```r
best_lambda <- cv_result$lambda.min
# lasso coefficient estimates (note shrinkage; sex/female variable dropped)
predict(mod_lasso, type = \coefficients\, s = best_lambda)
6 x 1 sparse Matrix of class \dgCMatrix\
1
(Intercept) -2.186379
age -0.015495
cesd 0.054849
avgAlcohol 0.008302
pss_fr -0.023553
male -0.185568
```r
```r
# logistic regression coefficient estimates
msummary(mod_logit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.60519 0.86254 -1.86 0.063 .
age -0.03458 0.01854 -1.86 0.062 .
cesd 0.06333 0.01259 5.03 4.9e-07 ***
sexmale -0.46303 0.30584 -1.51 0.130
avgAlcohol 0.01496 0.00691 2.17 0.030 *
pss_fr -0.04799 0.03408 -1.41 0.159
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 398.98 on 339 degrees of freedom
Residual deviance: 345.44 on 334 degrees of freedom
AIC: 357.4
Number of Fisher Scoring iterations: 4
```r
```r
# training accuracy
suicide_lassoProb <- predict(mod_lasso, s = best_lambda, newx = train_matrix, type = \response\)
suicide_lasso <- ifelse(suicide_lassoProb > 0.5, yes = \yes\, \no\)
confusion <- tally(suicide_lasso ~ suicide, data = train, format = \count\)
confusion
suicide
suicide_lasso no yes
no 239 75
yes 8 18
```r
```r
lasso_acc <- sum(diag(confusion)) / nrow(train) * 100
# Logistic regression training accuracy with/without regularization
lasso_acc # Regularization (Lasso)
[1] 75.59
```r
```r
logit_acc # No regularization
[1] 74.41